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Abstract 

In this article a flexible Bayesian non-parametric model is proposed for non- 
homogeneous hidden Markov models. The model is developed through the amal- 
gamation of the ideas of hidden Markov models and predictor dependent stick- 
breaking processes. Computation is carried out using auxiliary variable represen- 
tation of the model which enable us to perform exact MCMC sampling from the 
posterior. Furthermore, the model is extended to the situation when the predictors 
can simultaneously influence the transition dynamics of the hidden states as well 
as the emission distribution. Estimates of few steps ahead conditional predictive 
distributions of the response have been used as performance diagnostics for these 
models. The proposed methodology is illustrated through simulation experiments 
as well as analysis of a real data set concerned with the prediction of rainfall induced 
malaria epidemics. 

Key Words: Bayesian non-parametric mixture models, Conditionally varying 
density estimation, Non-homogeneous hidden Markov models, MCMC sampling, 
Slice sampling, Epidemic prediction. 



1 Introduction 



Hidden Markov models (HMMs), albeit considered to be the simplest forms of Bayesian 
networks, have been tremendously successful in statistical modeling of sequentially gen- 



erated data with applications in many different areas like speech recognition (Rabiner 



1989 Fox et al., 2008), proteomics (Bae et a/. 2005; Lennox et al, 2010), genetics and 



genomics (Guha et al. 2008 Yau et al, 2011), economics and finance (Hamilton, 1990 
Albert and Chib[ |1993b| ). |Rabiner| Ql989[ ); |Scott| fl2002) ; |Yoon] p009| have provided some 
excellent reviews on HMMs and their applications. 

Basic HMM consists of two processes: a hidden process, which evolves according to 
a first order Markov chain, and an observed process, which is conditionally temporally 
independent conditioned on the hidden state. Let yw denote potentially multivariate 
random variables observed sequentially over discrete time points t — 1, 2, . . . , T. Let Z\-t 
denote the associated sequence of hidden states. The HMM makes the following set of 
conditional independence assumptions to model the hidden and the observed processes 



P{z t \z 1:t -i) = P(z t \z t -i), 



P{vt\vi*-i, zitt-i) = P(,yt\z t )- 

Thus it allows the following factorization of the joint distribution of (yi-.r, z i-.t) 

T 

P(yi:T,z 1:T ) = P (z l )P(y 1 \z 1 )l[P(z t \z t ^)P(y t \z t ), 

t=2 

where Pq denotes the distribution of the initial hidden variable Z\. The observations yt's 
are usually assumed to have been generated according to some parametric probability 
law yt\z t ~ F(9 Zt ). In the context of HMMs, the family of distributions {F(8i)}i is 
referred to as the family of emission distributions and the indexing parameters {8{}i are 
known as the emission parameters. The conditional distributions P(z t \zt-i) governing 
the evolution of the latent sequence {zt}t over time are known as transition distributions. 




Figure 1: Graphical representation of an HMM. Unfilled and shaded nodes signify latent 
and observable variables respectively. 



A non-homogeneous hidden Markov model (NHMM) extends this idea by allowing 
the transition distribution of the hidden states to be dependent on a set of observed 
covariates (Hudges et al, 1999 Shirley et al., 2010). Denoting the observable sequence of 



potentially multivariate input variables by Xut the conditional independence assumptions 
for an NHMM can be stated as 



P0t|zi : ( t _i),zi ;t ) = P(z t \z t - 1 ,x t ), 
P{Vt\zx: t ) = P{y t \z t ), 

with the initial distribution now given by z\\x\ ~ Pq(xi). In this situation, the covariates 
influences the transition dynamics of the latent state variables but given z% the emission 
distribution does not depend on the value of the covariate. This model is called NHMM1 
in this article. 

This model can be further generalized when both the transition and the emission 
distributions depend on the observed covariates. In this frame work, the modeling as- 
sumptions become 

P(^|2i :(t _i),Z W ) = P(z t \z t -.x,x t ), 
P(y t \z 1:t ,x 1:t ) = P(y t \z t ,x t ), 

and this extended model will be referred to as NHMM2. 

Direct parallels can be drawn between HMMs and general mixture models. Building 
on ideas of infinite dimensional mixture models, the hierarchical Dirichlet process based 



HDP-HMM (or iHMM) developed by Teh et al. (2006) paved the way for full Bayesian 



non-parametric analysis of HMMs. Subsequently, significant extensions of these models 
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Figure 2: The two types of NHMM considered in this article, (a) NHMM1: the case when 
the predictors influence only the transition dynamics, (b) NHMM2: the more general case 
when the predictors have direct influence on the emission distribution as well. 



have been proposed by Van Gael et al. (2009); Fox et al. (2008). Recently in the context 



of HMM, a Bayesian non-parametric approach was taken by Yau et al. (2011) to model 



genomic copy number variations in array CGH data in the absence of any covariates. 
They considered an HMM with five states to model a mean function varying slowly over 
time. The residuals were modeled non-parametrically using Dirichiet process mixture. 



Taddy and Kottas (2009) also considered HMM and NHMM models with a finite number 



of known states and conditionally, given the hidden state, non-parametrically modeled 
the regression relationship between the predictors and the response. 

In this article, however, we develop non-parametric models to describe the influence 
of time and predictors on the dynamics of transition of the hidden states. A predictor 
dependent probit stick-breaking process has been used to model the transition dynamics 
of the NHMM as described in the next section. Primary emphasis of this article is on 
the efficient estimation of a few steps ahead predictive densities. The article is organized 
as follows. Section [2] introduces the model. An exact sampling algorithm for posterior 
computation is developed in Section |3j Section [4] discusses the prediction mechanism. 



A simulation study is presented in Section |5.1[ and an epidemiological application is 
described in Section 15.21 The article concludes with a discussion section. 



2 Model Specification 

An HMM can be treated as a mixture model where the mixing distribution is a Markov 
chain. To motivate our model, we, therefore, start with a brief review of Bayesian non- 
parametric mixtures models. In a mixture model, y\- n satisfies the conditional indepen- 
dence relation 

P(yi\y_i, z 1:n ) = P{yi\zi), 

where y-i denotes the set of all observations excluding the i-th one and z\._ n are associated 
hidden component labels, Zi = k implying that the i-th observation comes from the k-th 
mixture component. Component specific parameters, indexed by Zi and additional global 
parameters, if any, are kept implicit. In Bayesian non-parametric literature mixture 
labels Zi's are assigned a prior having countably infinite support {1,2,...} with random 
probability weights associated with its atoms. Flexibility and richness aside, increasing 
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popularity of these models can be attributed largely to the development of sophisticated 
computational machinery that has made implementation of these techniques routine in 
various applied problems. 

The most celebrated of this type of priors is perhaps the Dirichlet process (DP) prior 
(Ferguson, 1973 Lo, 1984 Escobar and West, 1995). A more general class of infinite 



dimensional mixing distributions, that includes the DP as a special case, is the class of 



stick-breaking priors (SBP) (Sethuraman, 1994; Ishwaran and James, 2001) 



P(zi) = S^khjzi), 



k=i 



where %k — u fcIIj=i(l ~~ v i) whh tti = i>i and v^'s are independent random variables 
taking values in the unit interval and 5 k denotes a point mass at k. Hyper priors on 
parameters governing the distribution(s) of v^s allow data to have more influence on the 
posterior. Predictor dependent random mixing distributions {P(x) : x G X}, where 
X denotes the sample space of an associated, possibly multivariate predictor variable x, 
can be constructed allowing the v k s to depend on x. Proposals have been plenty, most 



of them leading to challenging computation (Griffin and Steel, 2006; Dunson and Park 



2008 Chung and Dunson, 2008). Probit stick-breaking processes (PSBP) of Rodriguez 
and Dunson ( |2011 ), a sub-family of SBP as the nomenclature suggests, is obtained by 
setting Vk = $(%), probit transformation of some underlying random variable rjk that 
may depend on an associated predictor, if present. One such specification that admits 



easy posterior computation is given by Chung and Dunson (2009) who modeled Vk(x) = 
^( a k — Ym=i h,i\ x i~ x l i\) > where {x\ t } are chosen from a finite set of grid points covering 
the range of values of xi, the Z-th component of the p-variate predictor x. 
In this spirit, we propose a flexible model for NHMM as 



y t \x t , xjj, {Oj}f =l , z t ~ f(y t ; x t , xjj, Z 

6 3 ~Po(0j), 
lf> ~p (VO; 

z t \z t -i = j,x t ~ P(j,x t ), 

oo 



k=l 



k-l 



7TA: 



(j, x t ) = $ (a jk + (3 k h{x t ; x* k )) JJ { 1 - $ (a fl + Pih(x t ; x^)) } , 



i=i 



p (a>,(3,x*) =p (cx) po((3) po(x*). 



(2.1) 

(2.2) 
(2.3) 
(2.4) 

(2.5) 

(2.6) 
(2.7) 



The first equation (2.1) describes the emission distribution / which depends on the pre- 
dictor variables x t , the global parameter rj) and component specific parameters Q k . We 
assume that / belongs to a parametric class of distributions and assign parametric priors 
for if} and 6 k in equations (2.2) and (2.3). 



At the next hierarchical stage in equation (2.4), we use infinite mixture distribution 



to model the transition distribution P. This mixture distribution is given by the stick- 



breaking process as in equation (2.5). More importantly, P depends on the predictors 
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x t , and this dependence is induced through the weights {n k } k . Accordingly, the 7r fc 's are 
modeled using a probit stick breaking process in equation ( |2.6[ ) where $ is the Gaussian 
CDF and ctjf., /Vs are parameters of the probit stick breaking process. The function 
h(x t ] x*), introduced to model the influence of x t on the state dynamics, is specified as 
h(x; x*) = — \\x — xl\\ 2 , where {x* k }k are random locations on the predictor space X . It is 
assumed here that the components of a multivariate predictor are all standardized to bring 
them to a common scale. As \\x — x k \\ —> 0, $(<X/fc + f3kh(x,x k )) — > $(0;^). Therefore 
the maximum probability of transition from state j to state k is attained when x = x* k . 
Restricting the f3us to be positive, as x goes away from x* k1 $(«jfc + (3kh(x, x\)) — > 0, 
i.e. the probability of a transition to the k-th state decreases to 0. Other parameters 
remaining fixed, the conditional probability of making a transition from j to k increases 
with increase in <x,- fc . Larger values of /3 k result in faster decay of the probability of a 
transition to state k as the associated predictor value goes away from x* k and thereby 
implies that smaller regions of the predictor space around x* k favor the latent state k. 
Finally in equation (2.7), the prior for the parameters of the stick breaking process 
has been specified as p ((x, (3,x*). The above specification does not allow different (3 k 
components for different components of a multivariate predictor, leading to a sparse 
model. This restrictive assumption can be relaxed when large number of data points are 
available. 

To specify the initial distribution of Z\, we introduce a special initial state Zq that 
is always instantiated at a special value z = 0. Po( a3 i) — -P(0, £Ci) is then specified as 
P(0,xx) = J2T=i 7Tfc(0,a3i)4, with vr fe (0,a3i) = $(a fc + PkH x i'> x t)) YliZi I 1 ~ ®{ a oi + 
(3ih(xi; a^)) }, a form that allows inclusion of likelihood contribution from the first output 
variable y\ in updating f3i- Zl and x\. Zi a-posteriori. 

The proposed model, when the emission distribution is only implicitly influenced by 
the associated predictor value x t through z t , i.e. f(y t ;x t ,if),6 Zt ) = f(y t ;if),6 zt ), will be 
referred to as iNHMMl. The more general model, where the predictor directly influences 
the emission distribution will be referred to as iNHMM2. 



3 Exact MCMC Sampling from the Posterior 



This section describes the exact MCMC procedure to draw samples from the posterior us- 
ing auxiliary variables. The original algorithm for fitting infinite dimensional DPMM by 



Escobar (1988) and several notable variations of it, for example, MacEachern (1994), Es 



cobar and West (1995), Neal (2000) rely on integrating or 'marginalizing' out the random 



probability measure and work with the associated Polya urn characterization. Recent de- 
velopments have focused on sampling techniques that escape the need of integrating out 
the random probability measure. The approximate Gibbs sampler based on truncation 



by Ishwaran and James (2001), the exact retrospective sampler of Papaspiliopoulos and 



Roberts (2005) and the exact slice sampler of Walker (2007) and its extension by Kalli 



et al. (2011) to a more efficient version are significant contributions in this direction. The 



basic idea in slice samplers is to use random truncation rather than fixed truncation - 
to limit the space of cluster assignment variables to a random but finite size for each 
MCMC iteration through introduction of auxiliary slice-variables. By circumventing the 
necessity to marginalize out the random probability measure, these methods (or their 
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straight-forward extensions) also allow efficient posterior computation for many different 
types of stick-breaking processes. 

In the context of HMMs, an efficient recursive forward-backward (FB from here on- 
wards) sampler, was originally developed by Baum et al. ( 1973 ) for efficient execution of 
an EM algorithm. In a Bayesian paradigm direct Gibbs samplers can be implemented 
for posterior computation in HMMs. Stochastic versions of the FB sampler, lead to an 
alternative Gibbs sampling strategy that out-performs direct Gibbs sampler in that it 
results in more rapid mixing and less sample auto-correlation (Scott, 2002). Unfortu- 



nately the FB sampler can not be applied directly to HMMs with infinite state-space. As 
in the case of infinite mixture models, approximate sampling techniques based on finite 
truncation of the state-space and exact Gibbs samplers based on marginalization of the 



random probability measures can be developed for iHMM (Teh et al., 2006 Fox et al. 



2008). Introducing auxiliary slice- variables, the number of trajectories of latent sequences 



with positive probabilities for each MCMC iteration can be reduced to a finite size. Beam 
sampling, an efficient exact MCMC procedure for drawing samples from the posterior of 
iHMM, developed by Van Gael et al. (2008), builds on this idea and integrates together 
the FB and the slice sampling techniques. The procedure can be extended for exact 
sampling from the posterior of infinite dimensional HMMs with transition distributions 
constructed through stick-breaking processes. 

The MCMC procedure to be described here is developed by fusing together modified 
versions of the FB sampler of Chib (1996), the slice sampler of Kalli et al. (2011) and 



the auxiliary variable sampler of Chung and Dunson (2009) (see also Albert and Chib 



1993a). More specifically, the latent sequence Z\-t and the parameters {6 I -,}°^ 1 specifying 
the emission distribution are updated through a beam sampler and the parameters de- 
termining the transition probabilities are updated through an auxiliary variable sampler 
which is described in the following sections. 



3.1 Introduction of Auxiliary Variables 

We introduce a set of latent variables U\-t where the unconditional distribution of each 
Ut is uniform on the unit interval. For any positive sequence £ = {£,k\T=i e [0> 1]> we can 
write 



p(Ut, Z t = k\Zt-i = j, 7T k (j, X t ),£k) = l(«t < 6c) 



(3.1) 



The sequence is typically a deterministic decreasing sequence, although random 
sequences are allowed (see Kalli et al., 2011, for more details). This implies 



p{u t \z t = k, Zt-i = j, 7T fc (j, Xt),£k) 
p(zt = k\u t , Zt-i = j, 7T k (j, X t ),£k) 



p(u t , Z t = k\zt-! = j, 7T fc (j, x t ), f fc ) l(u t < £ k ) 



p{z t = k\z t -i = j, ir k (j, x t ), f fc ) 

p(u t , z t = k\zt-i = j, Tc k (j, x t ),£ k ) 

p{u t \z t -\ =j,nk(j> x t),€k) 
ir k (j,x t ) 
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oc l(k : u t < £ k ) 
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(3.2) 



(3.3) 
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Given z 1:T we also introduce latent auxiliary variables = {Wzt-x,i,t\Ti=\ wnere 



{W jk)t \a jk , fl k , x t ) ~ d N(a jk + (3 k h{x t ; x%), 1), (3.4) 
{z t = k\zt-i = j, x t } iff {W jk , t > and W jht < for I = 1, 2, . . . , k - 1}. (3.5) 

In what follows, £ denotes a generic variable that collects all the parameters that are not 
explicit. 

3.2 Updating the Latent State Sequence 

The recursive algorithm implemented in this article for updating the latent sequence, is 
a backward-forward (BF) sampler, a trivial variation of the FB sampler. 



Updating the latent sequence u\ : t'. From equation (3.1), we have 



( I -* £ I I t\ ^ U t < Set) /o «\ 

p(t*t|« , *i:T, Vl:T, X 1:T , C) = P(Ui|*t, £) = T • (3.6) 

• Updating the latent sequence Z\-^\ 

Define the backward messages (3 t (z t ) = p(y(t+i)-.T\ z t, x^+iy.T, u {t+i)-.T, C), with the bound- 
ary condition 0t(zt) — 1- The following recursion holds - 

Pt(*t) = P(y(t+l):T\zt, X( t+1 y. T , U( t+1 y T , C) 

= J^P(2/(t+l):T, Z t+1 \z t , SC( t+ i):r, U( t+ i). T , C) 

= y]p(y(t+2):T|Zt+l, g(t+2):T, "(t+2):T, P(Vt+l W+l , SBfrfl , C) P^t+l |«t, #t+l> «t+l» C) 

zt+l 

= ^2 /3t+i(z t+1 ) p(zt +1 \zt,x t+1 ,u t+ i,C) p(yt+i\zt+i,x t +i,C)- ( 3 - 7 ) 

Zt+l 



From equation ( 3.3[ ), it follows that, given it t+1 and the sequence the set of possible 



values of z t+ i, with p(z t+ i\z t , x t+ \, u t+ i, £) > 0, is finite and thus the above sum is to be 
taken only over finitely many values of Zt+i- The joint conditional posterior distribution 
of the latent states could be factorized as 

P(ZI;t\VI:T, U 1:T , Xi: T , £) = P{Zt\ZT-1,V1:T, «1:T, X UT , £) 

• • • p(z 2 \z 1 ,y 1:T ,U 1:T ,X 1 . T X)p(Zl\yi:T,Ul:T,X 1:T X), 



where p{z t \z t ^y 1 , T ,x x , T , u UT , Q oc p(z t , yurlyut-i, Zt-i, u 1:T , x 1:T , Q 

P(y(t+l):T\ z t, Zt-l, X 1:T , Ui:T, C) P{Vt\zt, Zt-l, X V . T , U i:T , C) 

X p(z t \z t -l, J/l:(t-l), JCijT, Ui :T) C) 

oc /3 t (z t )p(y t \z t , x t , C)p(zt\z t -i,x t , u t , C)- (3.8) 

To sample Zi : t from its full conditional we first pass messages (3t{ z t) backwards and then 
sample forwards. 
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3.3 Updating the Parameters of the Emission Distribution 

Conditional posterior distribution of the global parameters if) is given by 

T 

p(i/>\z 1:T , y 1:T , x 1:T , C) oc p (if)) Yl f(yt, x t , if>, Zt ). (3.9) 

t=i 

Conditional posterior distribution of cluster specific 6k is proportional to 

p(O k \z 1:T ,y 1:T ,x 1:T ,C) ocp o (0fc) IJ f(Vt> x t,i/>,Ok)- (3-10) 

{t:z t =k} 

If the set {t : z t = k} is an empty set, i.e. no observation is associated with latent state k, 
then p(0k\zi;T, Vi-.T, Xi-.t, C) is j us t proportional to its prior. The conditional posterior can 
be simplified if the family of distributions {po{6) : G 0} is conjugate for the emission 
distribution f(yt] x t , 6). 

3.4 Updating the Parameters of the Transition Distribution 

The parameters of the transition distribution can be updated through Gibbs sampler. 

• Updating auxiliary variables: For any Wjij G we have - 

(Wj-i )t |Zi :T ,yi:T,a!l:T,C) ~ [l(z t = l,Z t ^ = j) X N + (lV jl)t ] 01 ji + Plh{x t \ X*) , 1) + 

l(z t > l,zt-! =j) x N_(w jljt ;a jl + pih(x t ;x^),l)], (3.11) 

where N + (fi,cr 2 ) and iV_(/i, a 2 ) denote truncated Normal densities with location /j, and 
scale a truncated below and above zero respectively. That is 

p{W 3l , t \z, T , y 1:T , x 1:T , C) = I N _ {aji + mx . ^ 1} if Zt > Zj ^ = . (3.12) 

• Updating cc^'s: This implies - 
p(a|iw, zi :T ; j/i : t; x 1:T ; C) 

T r r Zt ~ 1 

Po(oc)J[ (f)(w Zt _ 1:Zut ;a Zt _ ljZt +f3 Zt h(x t ;x* t JJ ^{w Zt _ lU ]a Zt _ ltl +(5ih(x t ]X^),l) 



t=i 



p(aji\w,Z 1: T,yi:T,Xl:T,C) <XPo(<Xjl)\ j] <j>{Wjl,t\ <Xjl + PMx t ] X*) , 1) !• 

^ {t: z t =l,z t -i=j} ' 

x J JJ (f>(wji 7 t;aji + pih(x t ;xt),l) > 



{t: z t >Z,z t _i=j} 

oc Po(«j«) I [ <t>{wji,t] oiji + fiih{x t \ x*), 1) 



{t: zt>/,2t-l=j} 
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If we assume independent Normal prior for a^s i.e. assume a-priori otji ~ po(oiji) = 
N(otji] fi a , (j^) then we have 



p(a j i\w,z l .. T ,y l . T ,x l .. T X) = N(aji ;n ajl *,cr 2 ), 



(3.13) 



where // Q . ; * = a 2 



*>i,*-i=,}H'.* - PrHx t ; x*)} + g 



> = K + O' and 



n ii = Et=2 1 (^ > l,Zt-l =])■ 

• Updating ^'s: Similarly for the full conditionals of /3/'s we have - 

p{Pi\w,z 1:T ,yi:T,x 1:T ,C) ocp (/3i)l Yl <t>( w z t -i,i,f,az t - 1 ,i + (3ih(x t ;x*),l) \ 

^ {t: z t =l} 

x I II < />( w *t-Li,f,a* t - 1 j + Pih(x t ;xt),l) 

^ {t: z t >l} 

^Po(Pl) IJ <t ) ( W zt-l,l,ti a zt-l,l + PM X t,X*),l) 
{t: z t >l} 

If we assume independent truncated Normal prior for $'s i.e. assume a-priori /3i 

Po((3i) = N+(Pi; /ip, of) then we have 



ind 



p(Pi\w,z 1:T ,y 1:T ,x 1:T ,C) = ;/i A *,aL), 



(3.14) 



where ^ = a\ *{ V^+E{t:z t >o K x u x^){w Zt _ lU -a Zt _ ul )}, a fi 2 = { V+E{^>;} 
• Updating a^'s: The full conditionals of a?*'s are given by - 



p(x*\w,Z 1: T,yi:T,X 1:T ,C) OC 



t=i 



p(x*)H 0( 

w z t -i,z t ,ti a zt-i,zt 4>{w zt _ u i, t ] &z t -i,i+Pih(x t ; x*), 1) 



zt-l 



i=i 



p(x*\w, z 1:T , y 1:T , x 1:T , C) oc p(atf) I Yl 4>{wz t -uht, Uzt-i,i + PiK x t\ 1) > 

^ {t: z t =l} * 



{t: z t >«} 

^p( x t) JJ <f>(w Zt _ u i, t ] a Zt _ u i + foh(x t ] xf), 1). 

{t: zt>/} 

Assuming p(a?*) to be uniform over a discrete set of possible values of x* on the predictor 
space X, the above conditional posterior is a multinomial distribution and therefore can 
be easily sampled from. 
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4 Prediction 



Assume that true values of x^+i) : (T+n) are known (for example if {x t } is a deterministic 
sequence, or if it affects the response series {y t } with some time lag > n) or can be 
estimated with high precision. Collecting the transition and the emission parameters in 
£, the predictive density for yr+ n can be written as 

fT+t{yT+n\yi-.TiX UT+n) ) = J p{yT+n\yi;T,X 1: T,Z 1:T ,X ( T +1 y. ( T + n),C) dP{C, *l:T|l/l:T, X\ir) 
= J ^ P(yT+nW+n, XT+n, C) ^ P(z T +n W+n-l , X T+n , C) ^ K^T+n-l I^T+n-2, ^T+n 

2T+n ZT+n~l ZT+n-2 

••• ^ P(Z T+2 \ZT + 1,X T+ 2,C) P(zT+i\zT,X T +1,C) dP(C,Zl:T\yi:T,Xi :T ). (4.1) 

*r+i=l 

Exact evaluation of the predictive density would be computationally very challenging as 
it involves multiple integral over the parameter space w.r.t. a complex joint posterior 
density. Monte Carlo integration techniques, however, can give simple approximation 
formula. We use the notation LHS=RHS to signify that LHS is being estimated by 
RHS. Its actual significance being implicitly understood, we will henceforth refer to 
fT+t(yT+n\yi:T,x 1:iT+n) ) simply as Jr+H(VT+n)- A Monte Carlo estimate of f^(y T+n ) 
is given by 

fr+ZiVT+n) = T?X] P(yT+n\z T +n,X T+n X im) ) Yl P^T+n\z T +n-l, X T +n, < (m) ) 

m=l 2 T+n =l z T+n _i=l 

■■■ p(z T+ 2\zT + l,X T+2 X im) )p(zT + l\z^\x T+1 ,C {m) ) 

*T+1=1 

1 M 

= M Yl f (m) (yT+n\yi:T, X hT , X {T +l):(T+n)) = fr+ZiVT+n), Say, (4.2) 
m=l 

where and £( m ) are sampled values of Z\ : t and £ from m-th MCMC iteration, and 
Jf(™) = maxzg. 

5 Examples 

The methodology is illustrated through simulation experiments and a real world appli- 



cation. In Section 5.1| predictive performances of the two types of NHMM models are 



judged using synthetic data sets. An epidemiological application is presented in Section 

For all these examples we specify the hyper-priors as follows. Hyper-priors for the 
parameters of the emission distribution depend on the family the emission distribution 
comes from and also on the particular application at hand. Once the hyper-priors for the 
emission parameters have been specified, to facilitate convergence, we recommend that 
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a Dirichlet process mixture model (DPMM) be fitted to the y t values with likelihood 
function the same as the emission distribution but ignoring the time dependence and the 
influence of the predictors on the latent states. Latent states Z\_t could then be initialized 
at the cluster labels of the DPMM after sufficiently large number of MCMC iterations. 
Parameters of the emission distributions can be similarly initialized at the cluster specific 
parameter estimates from the DPMM. For univariate x t , the set of possible values of 
x\ can be taken to be {Q(x,p) : p = 0.1,0.3, . . . ,0.99}, where Q(x,p) denotes the p-th 
percentile of X\.t- x^s could be instantiated at argmin ^2^ t . Zt=k y \x t — x*\. Based on 
experience with simulation studies we also recommend that the prior hyper-parameters 
for Ojfc's and /Vs be set at fi a = 2, a a = 1,/ig = 2, erg = 2/3. The parameters a^'s and 
/3fe's could all be instantiated at their respective prior means. 



5.1 Simulation Experiments 

Simulation Design for iNHMMl: The sequence xt was generated through an AR(1) 
process x t = 0.95a;t_i+e t , where e t ~ A(0, 1). x t values were then standardized. The state 
space for the latent variables was Z — {1, 2, . . . , 5}, with associated x* values the 50-th, 
15-th, 85-th, 2-nd and 98-th percentiles of X\ : t respectively. Transition probabilities were 



calculated using (2.6)( the '=' sign now replaced by 'oc') with parameters ajk = 2, if j = k 
and ajk = 0.5, otherwise for all j, k e Z; (3k = 2 for all k G Z. Emission distribution for 
y t , given z t = k, was N(^i yt k, u y k) with a y ^ = 0.25 for all k and = 0, —2, 2, —4,4 for 
k = 1, 2, 3, 4, 5 respectively. Values of a y ^ were all equal to 0.25, so this parameter could 
have been treated as a global parameter. But while fitting the model cluster specific 
variances were allowed to be different. Conjugate Normal-Inv-Gamma(/io, ctq/^o, 7o, ctq) 
prior was assigned on (fj, yt k, a y,k) with /io = 0, u = 0.10 and 70 and were so chosen 
that the prior mean and sd of were 0.20 and 1 respectively. Two different sample 
sizes, T = 250 and T = 500, were considered. In each case predictor and response values 
for three additional time points were also simulated. The first T data points were used 
for fitting the model and in each case up to three steps ahead predictive densities were 
estimated. A total of B = 100 data sets were generated using this design. 10, 000 MCMC 
iterations were run in each case and initial 3, 000 iterations were discarded as burn-in. 

The mean integrated squared error (MISE) of n-step ahead prediction is defined as 
MISE = E J {fx+tiy) ~ fT+n(y)} 2 dy, which can be estimated by 

1 B N 

MISE = MISE est = 1 J2 Et/ST 1 ^) - /?]rV)} 2 A<, (5-1) 

6=1 i=l 

where {y^}£L are a se * of grid points on the range of y and A« = (yf 1 — y^ x ) for all 
i. Since a few extreme values can make the estimated MISEs large, 25-th and 50-th 
percentiles are also reported. The predictive performance is contrasted with that of an 
infinite dimensional homogeneous HMM, which will be referred to as iHMMPl, obtained 
by replacing all /Vs by zeros in iNHMMl. The iHMMPl formulation is based on simple 
probit stick breaking process and does not model the the influence of the predictor on 
the state dynamics. 

Simulation Design for iNHMM2: In case of iNHMM2 the emission distribution 
involves the predictor x t explicitly. Number of states, in this case, was fixed at 3 with 
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Figure 3: Prediction performance of iNHMMl: Three-step ahead true (green) vs estimated 
(blue) predictive densities for one simulated data set with sample size T= 500. The stars 
(red) represent true y values. 



Sample Size 


Steps ahead 


25th Percentile 


50th Percentile 


MISE est 


iNHMMl 


iHMMPl 


iNHMMl 


iHMMPl 


iNHMMl 


iHMMPl 




1 


0.0019 


0.0102 


0.0054 


0.0638 


0.0195 


0.1140 


T = 250 


2 


0.0020 


0.0245 


0.0067 


0.0781 


0.0247 


0.1696 




3 


0.0021 


0.0309 


0.0068 


0.1244 


0.0296 


0.2053 




1 


0.0011 


0.0128 


0.0039 


0.0403 


0.0099 


0.1068 


T = 500 


2 


0.0011 


0.0323 


0.0050 


0.0831 


0.0125 


0.1655 




3 


0.0012 


0.0417 


0.0050 


0.1375 


0.0135 


0.2037 



Table 1: Prediction performance of iNHMMl: Summary of one, two and three steps ahead 
prediction performance for simulation experiments in terms of MISEs 



associated x* values taken to be the 50-th, 10-th and 90-th percentiles of Xx-t- Emission 
distribution for y t , given z t = k and x t , was taken to be N(r]i k + r] 2 ^Xt, (Jy) with a y = 1 
(global parameter) and rf k = (rji^, V2,k) = (1, 1), (0, —2), (2, 4) for k — 1, 2, 3 respectively. 
In this case yt values were also standardized before fitting the model. We assumed 
N 2 (f]o,h) prior for r\ k with x]§ set at the least square estimate of t\q fitting a simple 
regression model y t = r/o + ^i^*- The prior for was Inv-Gamma(7o, (Jq) with prior mean 
and sd set at 0.20 and 1 respectively. Again the predictive performance is contrasted with 
that of a homogeneous HMM, referred to as iHMMP2, obtained by replacing all /3k 8 by 
zeros in iNHMM2. 

Simulation designs were carefully constructed to ensure diverse shapes of predictive 
densities. Figures [3] and [4] represent two such simulation experiments. The variety of 
shapes the model can capture should particularly be noted. Numerical summaries of 
prediction performance for one, two and three steps ahead prediction are presented in 
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Predictive Density of Y T 




Figure 4: Prediction performance of iNHMM2: Three-step ahead true (green) vs estimated 
(blue) predictive densities for one simulated data set with sample size T= 500. The stars 
(red) represent true y values. 



Sample Size 


Steps ahead 


25th Percentile 


50th Percentile 


MISE est 


iNHMM2 


iHMMP2 


iNHMM2 


iHMMP2 


iNHMM2 


iHMMP2 




1 


0.0025 


0.0411 


0.0082 


0.0920 


0.0254 


0.1403 


T = 250 


2 


0.0029 


0.0678 


0.0086 


0.1230 


0.0247 


0.1763 




3 


0.0037 


0.0546 


0.0110 


0.1266 


0.0325 


0.1906 




1 


0.0018 


0.0176 


0.0048 


0.0397 


0.0130 


0.0959 


T = 500 


2 


0.0016 


0.0273 


0.0069 


0.0704 


0.0131 


0.1419 




3 


0.0021 


0.0399 


0.0052 


0.0998 


0.0136 


0.1736 



Table 2: Prediction performance of iNHMM2: Summary of one, two and three steps ahead 
prediction performance for simulation experiments in terms of MISEs 



Table [T] (iNHMMl) and Table [| (iNHMM2). From the tables it can be clearly seen 
that modeling the influence of the predictor on the dynamics of the latent variables, 
produces much improved estimates of the predictive density. MISEs have also been 
reduced significantly in these situations. Although a general increasing trend in MISEs 
of one, two and three steps ahead predictions may be expected, since the uncertainty of 
the predictive distributions also depend on the associated predictor values, this may not 
always be the case. 



5.2 Application in Prediction of Malaria Epidemics 

We use the malaria data set used by Laneri et al. (2010), |Bhadra et al. (2011) in order 
to demonstrate the effectiveness of our proposed methodology. Figure 5] displays the 
monthly confirmed cases of P. falciparum and monthly rainfall in the district of Kutch, 
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an arid region in the state of Gujarat in Northwest India, between January, 1987 and 
December, 2006. The record of the monthly accumulated malaria cases are maintained 
by the National Institute of Malaria Research in India, and was originally compiled by 
the office of the District Malaria Officer. The monthly accumulated rainfall time series 
was obtained from a local weather station run by the Indian Meteorology Department. 
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Figure 5: Monthly reported P. falciparum malaria cases (solid line) and monthly rainfall 



from a local weather station (broken line) for Kutch, adapted from Bhadra et al. (2011). 
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Figure 6: Malaria data set: left pane shows the time series of standardized accumulated 
rainfall (black) and standardized monthly malaria cases (blue) from January, 1987 to De- 
cember, 2006; right pane shows the scatterplot of standardized accumulated rainfall (x-axis) 
vs standardized monthly malaria cases (y-axis). 



Variability in climate factors can explain a significant share of the variability in re- 
gional malaria incidence time series. Because the district of Kutch is located in desert 
region, a region with extreme climate conditions located at the edge of the geographical 
distribution of the disease, climate variables such as rainfall are expected to be relevant 



to disease dynamics (Laneri et al, 2010). This malaria time series also shows signs of 
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"epidemic" malaria, where the disease peaks in the winter months and typically dies out 
at the end of the winter. This is at a contrast with "endemic" malaria where low level 
infection persists throughout the year. Visually, an apparent lag relationship between 
rainfall and reported malaria cases is evident. A characteristic of the monsoon climate in 
this geographic region in India is that the rainfall typically peaks during summer mon- 
soon season leading to peaks in malaria several months later during dry winter seasons. A 
strong correlation of 0.84 between total monsoon rainfall (aggregated over June-August) 
and total winter malaria cases (aggregated over October-December) suggests a signifi- 
cant causal relationship (Bhadra et al., 2011). To check the correlation between monthly 
disease incidence (as opposed to aggregated disease cases over a few months) and the 
rainfall covariate, we used another window to accumulate rainfall over the past 4 to 6 
months from the present month and then shifted the accumulated rainfall by a forward 
lag of 1 or 2 months. This resulted in a maximum correlation of 0.72 when we excluded 
one outlier at September, 1989. 

Given these two time series, primary interest lies in the prediction of malaria epidemics 
for the winter season of a given year given the rainfall covariate until the month of 
September of that year. This will enable early preventive measures to be taken, if an 
epidemic is suspected. Use of hidden Markov models for modeling disease dynamics and 
epidemic prediction can be found in the literature. Rath et al. (2003), for example, used 
an HMM to characterize the non-epidemic and epidemic dynamics in a time series of 



influenza like disease incidence rates. See also Strat and Carrat (1999); Watkins et al. 



(2009); Conesa et al. (2011). 



We considered all the four models - iNHMMl, iHMMPl, iNHMM2 and iHMMP2, 
with rainfall accumulated over 5 months and shifted forward 2 months as the predictor 
and monthly malaria cases as the response. Covariate values for the first six months, 
January to June of 1987, were taken to be the average of that month, averaged over 
the remaining years. Henceforth the predictor will simply be referred to as accumulated 
rainfall. In our monthly time series of disease cases spanning over the 20 year period 
from January, 1987 to December, 2006, we define a particular year to be an epidemic 
year if the accumulated disease cases in that year is greater than the 75-th percentile of 
aggregated yearly cases, where the quantiles are computed based on the data from all 
of the 20 years (Table 3, Laneri et al. ( 2010[ )). Predictive performances of the models 
were evaluated for three different situations - uneventful summer months, winter months 
of an epidemic year and winter months of a non-epidemic year. The models were, thus, 
fitted to three different subsets of the malaria data set - 1. first subset consisting of 
data points from January, 1987 to March, 2006; 2. a second subset consisting of data 
points from January, 1987 to September, 2003 (an epidemic year); and 3. a third subset 
including data points from January, 1987 to September, 2006. Recent years 2003 and 
2006 were chosen to make the number of data points, used for fitting the models, the 
maximum available in each case. In the three cases considered, respectively T = 231, 201 
and 237 data points were available for model fitting. Predictor values and responses 
were standardized and in each case predictive densities were estimated for the following 
three months. Thus, in case of the first subset, predictive densities were estimated for 
the summer months of April, May and June, whereas in the latter two cases, predictive 
densities were estimated for the winter months of October, November and December. The 
lag effect of two months implies that xt+i and xt+2 are exactly known. But to calculate 
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xt+3, rainfall of the (T + l)-th month is required. In each case, this was estimated to 
be the average rainfall of that month calculated from the subset of the data used to fit 
the models. Note that the actual predictor is rainfall accumulated over five months and 
the rainfall of only one component month is required to be estimated. Also note that the 
monthly rainfall of the (T + l)-th month is actually available from the complete data but 
the models were never allowed to use 'future' observations. For iNHMMl and iHMMPl 
models, given z t , N(p V)Zt , o% Xt ) emission distribution with conjugate but diffuse Normal- 
Inv-Gamma(2, 10, 3, 1) prior for the emission parameters were fitted. For iNHMM2 and 
iHMMP2 models, given z t and x t , N(r] 0tZt + r]x tZt x t ,(Ty^ t ) distribution with non-conjugate 
diffuse N 2 (r} , J 2 ) x Inv-Gamma(7 , <Tq) priors for rj k = (r] 0jk , i]i k )' and a yk were fitted. rj 
was set at the least square estimate of T70 fitting a simple regression model y t = r] + rjiXt 
and prior mean and sd of a 2 y k were set at 0.5 and 1. Increasing variability of malaria cases 
with increase in accumulated rainfall can be seen from Figure [6j prompting us to consider 
conditionally heteroscedastic emission distributions. For priors for the parameters of the 
transition distributions and initialization of the MCMC chain, we refer the reader to the 
beginning of Section |5j The scatterplot of accumulated rainfall vs malaria cases does not, 
however, show any locally varying linear relationship. Indeed iNHMMl and iNHMM2 
(and similarly iHMMPl and iHMMP2) models produced very similar results (model fits 
and predictive densities) in all three situations. Results for only the more parsimonious 
iNHMMl and iHMMPl models are, therefore, presented. 
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Figure 7: Prediction results for summer of 2006: The larger window at the top represent 
yt series (green) used to fit the model and the posterior mean sequence (blue) estimated 
by iNHMMl. The three smaller panes at the bottom show the three-step ahead estimated 
predictive densities for (standardized) yt series using iNHMMl (blue) and iHMMPl (black) 
for the months of April, May and June of 2006. The stars (red) represent true (standardized) 
yt values. 



As can be seen from the Figure [7] and Figure [8j for predicting malaria cases for the 
months of April, May and June of 2006 the models with and without the covariate produce 
almost identical results. However the model without accumulated rainfall as covariate 
performs poorly in the more important case of estimating the predictive distribution of 
malaria cases for the months of winter (October, November and December). Because of 
the presence of only a few sharp peaks in the entire data set, the model without covariate 
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Figure 8: Prediction results for winters of 2003 and 2006: The larger window at the top 
represent yt series (green) used to fit the model and the estimated posterior mean (blue) 
sequence. The three smaller panes at the bottom show the three-step ahead estimated 
predictive densities for (standardized) yt series using iNHMMl (blue) and iHMMPl (black) 
for the months of October, November and December of (a) 2003 (an epidemic year) and 
(b) 2006 (a non-epidemic year) respectively. The stars (red) represent true (standardized) 
yt values. 

assigns, irrespective of the previous states, very small probability of transitions to states 
that favor large number of monthly malaria cases and large variance. Conditionally given 
large values of the predictor, the iNHMMl model, however, increases the probability of 
a transition to states favoring large number of malaria cases and gives more realistic 
estimates of the predictive distribution (and associated uncertainty) of malaria cases in 
winter. 

6 Discussion 

In this article two variations of NHMMs are proposed based on flexible Bayesian non- 
parametric predictor dependent infinite mixture models. Efficient algorithms for exact 
posterior computation were developed. The proposed methodology is able to produce 
the full predictive distributions, instead of providing only point prediction estimates. 
Furthermore, this methodology is flexible enough to accommodate multivariate predictors 
and responses as well as a wide variety of emission distributions including distributions 
for discrete responses. 

The model, introduced in this article, inherits all the strengths and limitations of 
HMMs and predictor dependent infinite mixture models. Framework of HMMs, makes 
the model applicable to situations when the dynamics over time space could be non-linear. 
Use of predictor dependent infinite mixture models, on the other hand, encompasses mod- 
eling of scenarios when the change in the shape of the predictive distribution with change 
in values of the predictor may not follow standard parametric laws. Efficient recovery 
of widely varying predictive densities in simulation experiments and an important epi- 
demiological application illustrate the flexibility and scope of the proposed methodology. 
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On the other hand, since the methodology attempts to model dynamical systems with 
complex dependence relationships between the predictor and the response, moderately 
large number of observations may be required. 

The MCMC simulation scheme presented in the paper was exact but does not allow 
online learning of the dynamical system being modeled. Inclusion of new data points 
would necessitate refitting of the models. Since the problem of malaria epidemic pre- 
diction, described in this paper, required that predictions be made on a monthly basis, 
the computational cost of refitting the models was not an issue. For applications, where 
online prediction is of importance, sequential Monte Carlo methods can be developed 
for these models. Ongoing and future research projects also include applications of the 
methodology developed here in the fields of biology and bio-informatics and an extension 
to jointly model the transition dynamics and emission distributions within a nonpara- 
metric framework. 
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